Screen Shot 2022-12-13 at 2.59.28 PM.png

Establishing a Viable Workspace

In [2]:
# mounting my google drive
from google.colab import drive
drive.mount('/content/drive')
# changing the directory to my folder designated for this project
%cd /content/drive/My Drive/TU/SEMESTERS/f2022/data_final_proj
Mounted at /content/drive
/content/drive/My Drive/TU/SEMESTERS/f2022/data_final_proj
In [3]:
# importing important packages I will need 
import matplotlib.pyplot as plt
import pandas as pd
!pip install geopandas
import geopandas as gpd
import numpy as np
import requests
import seaborn as sns 
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting geopandas
  Downloading geopandas-0.12.2-py3-none-any.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 5.1 MB/s 
Requirement already satisfied: shapely>=1.7 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.8.5.post1)
Requirement already satisfied: packaging in /usr/local/lib/python3.8/dist-packages (from geopandas) (21.3)
Requirement already satisfied: pandas>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from geopandas) (1.3.5)
Collecting pyproj>=2.6.1.post1
  Downloading pyproj-3.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
     |████████████████████████████████| 7.8 MB 47.1 MB/s 
Collecting fiona>=1.8
  Downloading Fiona-1.8.22-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.6 MB)
     |████████████████████████████████| 16.6 MB 124 kB/s 
Collecting munch
  Downloading munch-2.5.0-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (7.1.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (2022.9.24)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (22.1.0)
Collecting cligj>=0.5
  Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Collecting click-plugins>=1.0
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.8/dist-packages (from fiona>=1.8->geopandas) (1.15.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (2022.6)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.0.0->geopandas) (1.21.6)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->geopandas) (3.0.9)
Installing collected packages: munch, cligj, click-plugins, pyproj, fiona, geopandas
Successfully installed click-plugins-1.1.1 cligj-0.7.2 fiona-1.8.22 geopandas-0.12.2 munch-2.5.0 pyproj-3.4.1

proj goals.jpg

  1. Determine whether the driver or the constructor plays a larger role in Formula One success

    • I will utilize a regression discontinuity framework to identify the effect that switching teams had to a driver's average lap time, pitstop time and fastest speed for a given year. If the driver switches to a 'better' team, it is expected that average lap time and pitstop time will decrease and average fastest speed will increase.
  2. Build a regression model that will predict how a driver's average finishing position will be affected if they switch teams between seasons.

    • I will utilize historical data on the current 10 Formula One teams to build a regression model that will predict how a driver's average finishing position for a will be affected if they switch teams. I will consider the following variables when building my model: lap times, pitstop times, average starting position, average finishing position, fastest lap speed and constructor reference information.

My motivations: F1 is one of the few sports that is almost completely reliant on data. Telemetry from the cars to the pits has been seen in races since the 1980s however, F1 data usage has greatly expanded since then. Dependent on the team, cars can be fitted with upwards of 300 sensors that generate over one million data points per second. Teams are limited in the number of employees that can be present on race day and thus, teams are utilizing the mass collection of data in real time and transmitting said data to analysts and data engineers at an off-site location for immediate feedback. Drivers are able to not only rely on their gut instinct and years of training, but now they can also rely on models and predictions in real time to make difficult decisions. This is why I would consider F1 analytics to be important. It is interesting to me because F1 has been rapidly expanding into the USA recently with social media promotions and deals with Netflix US and their kind of analytics would constitute my dream job and is ultimately why I decided to pursue this as my milestone topic.

Further Reading about Data in Formula One:

  1. Data Collection and Formula One Cars
  2. Amazon Web Services Partnership
  3. The Future of Formula One

DATA: DESCRIPTIONS, LOADING, SCRAPING, & CLEANING

Screen Shot 2022-12-13 at 3.12.01 PM.png

  1. circuits.csv : This dataset provides pertinent information about the different circuits within Formula One. A circuit is a track where a Formula One race occurs. Within this dataset the revelevant variables are: circuitId, circuitRef, location (city), location (country, latitude, longitude, and altitude.
  1. constructor_results.csv : Within Formula One, a constructor is the company or manufacturer that is responsible for the design, development, and construction of a team's car. This dataset summarizes how each constructor did at each race and how many points they received. Within this dataset, the revelant variables are: constructorResultId, raceId, constructorId, & points.
  1. constructor_standings.csv : This dataset highlights how each constructor did at specific races as well as the points their team received for said standings. This dataset is unique from constructors_results as it also includes information about a constructor did for the entire season.
  1. constructors.csv : This dataset gives information on the different constructors within Formula One, their specific ID, and nationality. The following variables are relevant: constructorId, constructorRef, & nationality.
  1. driver_standings.csv : This dataset illustrates the outcomes of each race. It shows how each driver did at specific races and how many points were awarded. The following variables are pertinent to our analysis: raceId, driverId, points, position, & wins (number of wins so far in the season).

  2. drivers.csv : This dataset gives detailed information on all Formula One drivers for the past 70+ years. Within this dataset, there is information on nationality as well as age. The following variables were important in our analysis: driverId, forename, surname & nationality.

  3. lap_times.csv : This dataset details each driver and how they performed each lap of every race. Each row signifies how a specific driver performed for each lap. We are given position after completing the lap as well as the time of lap. The following variables were used throughout my analysis: raceId, driverId, lap, milliseconds (time of lap in milliseconds), & position (driver position after lap).

  4. pit_stops.csv : This dataset provides information about all pitstops within the races. The pitstop dataset provides detailed information on stop length, what lap the stop occurred, and which driver was stopping. The following variables were pertinent to our analysis: raceId, driverId, stop (for a specific driver at a specific race, which stop was this), lap number (lap when the stop occurred), & milliseconds (time of the pitstop in milliseconds).

  1. qualifying.csv : To determine grid placement on race day, drivers compete in qualifying rounds on Saturday. This dataset details how each driver did at qualifiers, the overall time for each lap, and the placement earned for Sunday's race. The following variables were pertinent and used in my analysis: raceId, driverId, constructorId, position (where the driver would start based on their qualifying times), q1, q2, q3. More information about how qualifiers work in F1 can be found later in the analysis.
  1. races.csv : This dataset gives specific IDs to each race within Formula One between 1950-2022. The dataset has information on the circuit, what date the race was on, and even what time the race began. The following variables are pertinent: raceId, year, circuitId, & name (of the circuit). There was also information about day and time of the race, qualifying rounds and sprint races, however, none of this data was used in my analysis.
  1. results.csv : This dataset summarizes pertinent information regarding the results of each Formula One race. Within the dataset, we can see where each driver started, where they finished and their fastest lap time. Some of the pertinent variables are: raceId, driverId, constructorId, grid (position when starting), position (position when finished), points, laps (the number of races in said race), time in milliseconds, fastestLap, fastestLapTime, fastestLapSpeed, & status of race.
  1. seasons.csv : This dataset provides a quick Wikipedia link for pertinent information on each Formula One Season.
  1. sprint_results.csv : Sprint races are new to Formula One. They are not offered at every race but typically only at 'big' races. Only 1/3 the time of a normal race, the drivers compete in a 'mock race' to earn more points for their team and learn the track. This dataset captures important timing and placing metrics for the recent sprint races.
  1. status.csv : This is a foreign key table that provides text explanations for the different race statuses.

Screen Shot 2022-12-13 at 3.48.16 PM.png

In [4]:
# reading each of the datasets into a dataframe 
circuits = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/circuits.csv")
constructor_results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructor_results.csv")
constructor_standings= pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructor_standings.csv")
constructors = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/constructors.csv")
driver_standings = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/driver_standings.csv")
drivers = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/drivers.csv")
lap_times = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/lap_times.csv")
pit_stops = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/pit_stops.csv")
qualifying = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/qualifying.csv")
races = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/races.csv")
results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/results.csv")
seasons = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/seasons.csv")
sprint_results = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/sprint_results.csv")
status = pd.read_csv("/content/drive/MyDrive/TU/SEMESTERS/f2022/data_final_proj/status.csv")

Data Set Citations

  1. I found all of these datasets on Kaggle. The user who posted these datasets cleaned the data and compiled each table from http://ergast.com/mrd/

Screen Shot 2022-12-13 at 3.53.07 PM.png

Within Formula One's 72 year history, there have been over 171 constructors. In order to accuratly make predictions based on the current Formula One Constructors, I scraped the wikipedia pages of the ten constructors who raced in 2022. I was able gather historical information on drivers for these 10 teams. Later in this project, I concactonated the data from these teams and I was able to see driver switches over time. The web scraping I did acted as a foreign key to all the Kaggle data frames, once it was in place, I could down select any of the Kaggle dataframes to only include information on current drivers.

In [5]:
## Red Bull Racing: Drivers by Year
url = 'https://en.wikipedia.org/wiki/Red_Bull_Racing'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

red_bull_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  red_bull_drivers.append(df_t[0])

red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['Car', 'Engine', 'Tyres', 'No.', 'Points', 'Position', 'Name'])
red_bull_drivers[3] = red_bull_drivers[3].join(red_bull_drivers[3]['Drivers'].str.split( expand =True).rename(
    columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}
)) 
red_bull_drivers[3]['Driver1'] = red_bull_drivers[3]['First1']  + ' ' + red_bull_drivers[3]['Last1'] 
red_bull_drivers[3]['Driver2'] = red_bull_drivers[3]['First2']  + ' ' + red_bull_drivers[3]['Last2'] 
red_bull_drivers[3]['Driver3'] = red_bull_drivers[3]['First3']  + ' ' + red_bull_drivers[3]['Last3'] 

red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['Drivers', 'First1', 'Last1', 'First2', 'Last2', 'First3', 'Last3'])
red_bull_drivers[3].drop([18], axis=0, inplace=True)
red_bull_drivers[3] = pd.melt(red_bull_drivers[3], id_vars=['Year'])
red_bull_drivers[3] = red_bull_drivers[3].dropna()
red_bull_drivers[3]['Constructor'] = 'Red Bull'
red_bull_drivers[3] = red_bull_drivers[3].rename(columns={"value": "Driver"})
red_bull_drivers[3] = red_bull_drivers[3].drop(columns = ['variable'])


red_bull_drivers[3]['constructorId'] = '9'


red_bull_drivers[3]
Out[5]:
Year Driver Constructor constructorId
0 2005 David Coulthard Red Bull 9
1 2006 David Coulthard Red Bull 9
2 2007 David Coulthard Red Bull 9
3 2008 David Coulthard Red Bull 9
4 2009 Mark Webber Red Bull 9
5 2010 Sebastian Vettel Red Bull 9
6 2011 Sebastian Vettel Red Bull 9
7 2012 Sebastian Vettel Red Bull 9
8 2013 Sebastian Vettel Red Bull 9
9 2014 Sebastian Vettel Red Bull 9
10 2015 Daniel Ricciardo Red Bull 9
11 2016 Daniel Ricciardo Red Bull 9
12 2017 Daniel Ricciardo Red Bull 9
13 2018 Daniel Ricciardo Red Bull 9
14 2019 Pierre Gasly Red Bull 9
15 2020 Alex Albon Red Bull 9
16 2021 Sergio Pérez Red Bull 9
17 2022 Max Verstappen Red Bull 9
18 2005 Christian Klien Red Bull 9
19 2006 Christian Klien Red Bull 9
20 2007 Mark Webber Red Bull 9
21 2008 Mark Webber Red Bull 9
22 2009 Sebastian Vettel Red Bull 9
23 2010 Mark Webber Red Bull 9
24 2011 Mark Webber Red Bull 9
25 2012 Mark Webber Red Bull 9
26 2013 Mark Webber Red Bull 9
27 2014 Daniel Ricciardo Red Bull 9
28 2015 Daniil Kvyat Red Bull 9
29 2016 Daniil Kvyat Red Bull 9
30 2017 Max Verstappen Red Bull 9
31 2018 Max Verstappen Red Bull 9
32 2019 Alex Albon Red Bull 9
33 2020 Max Verstappen Red Bull 9
34 2021 Max Verstappen Red Bull 9
35 2022 Sergio Pérez Red Bull 9
36 2005 Vitantonio Liuzzi Red Bull 9
37 2006 Robert Doornbos Red Bull 9
47 2016 Max Verstappen Red Bull 9
50 2019 Max Verstappen Red Bull 9
In [6]:
## Merecedes: Drivers by Year
url = 'https://en.wikipedia.org/wiki/Mercedes-Benz_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

mercedes_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  mercedes_drivers.append(df_t[0])

mercedes_drivers[5].drop([2], axis =0, inplace =True)
mercedes_drivers[5] = mercedes_drivers[5][['Year', 'Drivers']]

# Please note that Mercredes raced drivers in 1954 and 1955 before taking a hiatus uintil 2010. 
# For the purposes of my analysis on the current Mercedes team, I did not include 1954 - 1955 as the rules were different and there were many more drivers
mercedes_drivers[5].drop([0], axis =0, inplace =True)
mercedes_drivers[5].drop([1], axis =0, inplace =True)
mercedes_drivers[5].drop([16], axis =0, inplace =True)

mercedes_drivers[5] = mercedes_drivers[5].join(mercedes_drivers[5]['Drivers'].str.split( expand =True).rename(
    columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}))

mercedes_drivers[5]['Driver1'] = mercedes_drivers[5]['First1']  + ' ' + mercedes_drivers[5]['Last1'] 
mercedes_drivers[5]['Driver2'] = mercedes_drivers[5]['First2']  + ' ' + mercedes_drivers[5]['Last2'] 
mercedes_drivers[5]['Driver3'] = mercedes_drivers[5]['First3']  + ' ' + mercedes_drivers[5]['Last3'] 

mercedes_drivers[5] = mercedes_drivers[5][['Year', 'Driver1', 'Driver2', 'Driver3']]
mercedes_drivers[5] = pd.melt(mercedes_drivers[5], id_vars=['Year'])
mercedes_drivers[5] = mercedes_drivers[5].dropna()

mercedes_drivers[5]['Constructor'] = 'Mercedes'
mercedes_drivers[5] = mercedes_drivers[5].rename(columns={"value": "Driver"})
mercedes_drivers[5] = mercedes_drivers[5].drop(columns= ['variable'])
mercedes_drivers[5]['constructorId'] = '131'


mercedes_drivers[5]
Out[6]:
Year Driver Constructor constructorId
0 2010 Michael Schumacher Mercedes 131
1 2011 Michael Schumacher Mercedes 131
2 2012 Michael Schumacher Mercedes 131
3 2013 Nico Rosberg Mercedes 131
4 2014 Nico Rosberg Mercedes 131
5 2015 Nico Rosberg Mercedes 131
6 2016 Nico Rosberg Mercedes 131
7 2017 Lewis Hamilton Mercedes 131
8 2018 Lewis Hamilton Mercedes 131
9 2019 Lewis Hamilton Mercedes 131
10 2020 Lewis Hamilton Mercedes 131
11 2021 Lewis Hamilton Mercedes 131
12 2022 Lewis Hamilton Mercedes 131
13 2010 Nico Rosberg Mercedes 131
14 2011 Nico Rosberg Mercedes 131
15 2012 Nico Rosberg Mercedes 131
16 2013 Lewis Hamilton Mercedes 131
17 2014 Lewis Hamilton Mercedes 131
18 2015 Lewis Hamilton Mercedes 131
19 2016 Lewis Hamilton Mercedes 131
20 2017 Valtteri Bottas Mercedes 131
21 2018 Valtteri Bottas Mercedes 131
22 2019 Valtteri Bottas Mercedes 131
23 2020 George Russell Mercedes 131
24 2021 Valtteri Bottas Mercedes 131
25 2022 George Russell Mercedes 131
36 2020 Valtteri Bottas Mercedes 131
In [7]:
#Ferrari
url = 'https://en.wikipedia.org/wiki/Ferrari_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

ferrari_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  ferrari_drivers.append(df_t[0])

ferrari_drivers= pd.concat([ferrari_drivers[6], ferrari_drivers[7]], ignore_index=True)

ferrari_drivers = ferrari_drivers[['Year', 'Driver']]
ferrari_drivers = ferrari_drivers.dropna()
ferrari_drivers.drop([30], axis =0, inplace =True)
ferrari_drivers.drop([40], axis =0, inplace =True)


# Adding constructor name 
ferrari_drivers['Constructor'] = 'Ferrari'
ferrari_drivers['constructorId'] = '6'


ferrari_drivers
Out[7]:
Year Driver Constructor constructorId
1 2010 Fernando Alonso Ferrari 6
2 2010 Felipe Massa Ferrari 6
4 2011 Fernando Alonso Ferrari 6
5 2011 Felipe Massa Ferrari 6
7 2012 Fernando Alonso Ferrari 6
8 2012 Felipe Massa Ferrari 6
10 2013 Fernando Alonso Ferrari 6
11 2013 Felipe Massa Ferrari 6
13 2014 Fernando Alonso Ferrari 6
14 2014 Kimi Räikkönen Ferrari 6
16 2015 Kimi Räikkönen Ferrari 6
17 2015 Sebastian Vettel Ferrari 6
19 2016 Kimi Räikkönen Ferrari 6
20 2016 Sebastian Vettel Ferrari 6
22 2017 Kimi Räikkönen Ferrari 6
23 2017 Sebastian Vettel Ferrari 6
25 2018 Kimi Räikkönen Ferrari 6
26 2018 Sebastian Vettel Ferrari 6
28 2019 Charles Leclerc Ferrari 6
29 2019 Sebastian Vettel Ferrari 6
32 2020 Charles Leclerc Ferrari 6
33 2020 Sebastian Vettel Ferrari 6
35 2021 Charles Leclerc Ferrari 6
36 2021 Carlos Sainz Jr. Ferrari 6
38 2022 Charles Leclerc Ferrari 6
39 2022 Carlos Sainz Jr. Ferrari 6
In [8]:
# Mclaren
url = 'https://en.wikipedia.org/wiki/McLaren_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

mclaren_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  mclaren_drivers.append(df_t[0])

mclaren_drivers = pd.concat([mclaren_drivers[5], mclaren_drivers[6]], ignore_index=True)

mclaren_drivers = mclaren_drivers[['Year', 'Drivers']]
mclaren_drivers = mclaren_drivers.dropna()
mclaren_drivers.drop([33], axis =0, inplace =True)
mclaren_drivers.drop([43], axis =0, inplace =True)


# Adding constructor name 
mclaren_drivers['Constructor'] = 'McLaren'
mclaren_drivers = mclaren_drivers.rename(columns={"Drivers": "Driver"})
mclaren_drivers['constructorId'] = '1'


mclaren_drivers
Out[8]:
Year Driver Constructor constructorId
1 2010 Jenson Button McLaren 1
2 2010 Lewis Hamilton McLaren 1
4 2011 Jenson Button McLaren 1
5 2011 Lewis Hamilton McLaren 1
7 2012 Jenson Button McLaren 1
8 2012 Lewis Hamilton McLaren 1
10 2013 Jenson Button McLaren 1
11 2013 Sergio Pérez McLaren 1
13 2014 Jenson Button McLaren 1
14 2014 Kevin Magnussen McLaren 1
16 2015 Jenson Button McLaren 1
17 2015 Kevin Magnussen McLaren 1
18 2015 Fernando Alonso McLaren 1
20 2016 Fernando Alonso McLaren 1
21 2016 Stoffel Vandoorne McLaren 1
22 2016 Jenson Button McLaren 1
24 2017 Fernando Alonso McLaren 1
25 2017 Jenson Button McLaren 1
26 2017 Stoffel Vandoorne McLaren 1
28 2018 Fernando Alonso McLaren 1
29 2018 Stoffel Vandoorne McLaren 1
31 2019 Lando Norris McLaren 1
32 2019 Carlos Sainz Jr. McLaren 1
35 2020 Lando Norris McLaren 1
36 2020 Carlos Sainz Jr. McLaren 1
38 2021 Lando Norris McLaren 1
39 2021 Daniel Ricciardo McLaren 1
41 2022 Lando Norris McLaren 1
42 2022 Daniel Ricciardo McLaren 1
In [9]:
# BWT Alpine
url = 'https://en.wikipedia.org/wiki/Alpine_F1_Team'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

alpine_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  alpine_drivers.append(df_t[0])

alpine_drivers[1] = alpine_drivers[1][['Year', 'Drivers']]
alpine_drivers[1] = alpine_drivers[1].dropna()
alpine_drivers[1].drop([6], axis =0, inplace =True)

alpine_drivers[1]['Constructor'] = 'Alpine F1 Team'
alpine_drivers[1]['constructorId'] = '214'
alpine_drivers[1] = alpine_drivers[1].rename(columns={"Drivers": "Driver"})

alpine_drivers[1]
Out[9]:
Year Driver Constructor constructorId
1 2021 Fernando Alonso Alpine F1 Team 214
2 2021 Esteban Ocon Alpine F1 Team 214
4 2022 Fernando Alonso Alpine F1 Team 214
5 2022 Esteban Ocon Alpine F1 Team 214
In [10]:
# Alpha Tauri 
url = 'https://en.wikipedia.org/wiki/Scuderia_AlphaTauri'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

alpha_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  alpha_drivers.append(df_t[0])

alpha_drivers[1] = alpha_drivers[1][['Year', 'Drivers']]
alpha_drivers[1] = alpha_drivers[1].dropna()
alpha_drivers[1].drop([9], axis =0, inplace =True)

alpha_drivers[1]['Constructor'] = 'AlphaTauri'
alpha_drivers[1]['constructorId'] = '213'
alpha_drivers[1] = alpha_drivers[1].rename(columns={"Drivers": "Driver"})

alpha_drivers[1]
Out[10]:
Year Driver Constructor constructorId
1 2020 Pierre Gasly AlphaTauri 213
2 2020 Daniil Kvyat AlphaTauri 213
4 2021 Pierre Gasly AlphaTauri 213
5 2021 Yuki Tsunoda AlphaTauri 213
7 2022 Pierre Gasly AlphaTauri 213
8 2022 Yuki Tsunoda AlphaTauri 213
In [11]:
# Aston Martin 
# Please note that Aston Martin also raced in 1959 - 1960, but I will not be including them in this analysis 
url = 'https://en.wikipedia.org/wiki/Aston_Martin_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

aston_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  aston_drivers.append(df_t[0])

aston_drivers[3] = aston_drivers[3][['Year', 'Driver']]
aston_drivers[3] = aston_drivers[3].dropna()
aston_drivers[3].drop([7], axis =0, inplace =True)

aston_drivers[3]['Constructor'] = 'Aston Martin'
aston_drivers[3]['constructorId'] = '117'

aston_drivers[3]
Out[11]:
Year Driver Constructor constructorId
1 2021 Lance Stroll Aston Martin 117
2 2021 Sebastian Vettel Aston Martin 117
4 2022 Lance Stroll Aston Martin 117
5 2022 Sebastian Vettel Aston Martin 117
6 2022 Nico Hülkenberg Aston Martin 117
In [12]:
# Alfa Romeo 
# Please note Alfa Romeo has raced in 1950-1951, 1979-1985, and 2019-2022 but for this project I will only include the third set of dates
url = 'https://en.wikipedia.org/wiki/Alfa_Romeo_in_Formula_One'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

alfa_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  alfa_drivers.append(df_t[0])

alfa_drivers[2] = alfa_drivers[2][['Year', 'Drivers']]
alfa_drivers[2] = alfa_drivers[2].dropna()
alfa_drivers[2].drop([2], axis =0, inplace =True)
alfa_drivers[2].drop([10], axis =0, inplace =True)
alfa_drivers[2].drop([15], axis =0, inplace =True)
alfa_drivers[2].drop(alfa_drivers[2].loc[0:9].index, axis =0, inplace =True)

alfa_drivers[2] = alfa_drivers[2].join(alfa_drivers[2]['Drivers'].str.split( expand =True).rename(
    columns = {0: 'First1', 1: 'Last1', 2:'First2', 3:'Last2', 4:'First3', 5:'Last3'}))

alfa_drivers[2]['Driver1'] = alfa_drivers[2]['First1']  + ' ' + alfa_drivers[2]['Last1'] 
alfa_drivers[2]['Driver2'] = alfa_drivers[2]['First2']  + ' ' + alfa_drivers[2]['Last2'] 
alfa_drivers[2]['Driver3'] = alfa_drivers[2]['First3']  + ' ' + alfa_drivers[2]['Last3'] 

alfa_drivers[2] = alfa_drivers[2][['Year', 'Driver1', 'Driver2', 'Driver3']]

alfa_drivers[2] = pd.melt(alfa_drivers[2], id_vars=['Year'])
alfa_drivers[2] = alfa_drivers[2].dropna()

alfa_drivers[2]['Constructor'] = 'Alfa Romeo'
alfa_drivers[2]= alfa_drivers[2].drop(columns= ['variable'])
alfa_drivers[2] = alfa_drivers[2].rename(columns={"value": "Driver"})
alfa_drivers[2]['constructorId'] = '51'



alfa_drivers[2]
Out[12]:
Year Driver Constructor constructorId
0 2019 Kimi Räikkönen Alfa Romeo 51
1 2020 Kimi Räikkönen Alfa Romeo 51
2 2021 Kimi Räikkönen Alfa Romeo 51
3 2022 Zhou Guanyu Alfa Romeo 51
4 2019 Antonio Giovinazzi Alfa Romeo 51
5 2020 Antonio Giovinazzi Alfa Romeo 51
6 2021 Robert Kubica Alfa Romeo 51
7 2022 Valtteri Bottas Alfa Romeo 51
10 2021 Antonio Giovinazzi Alfa Romeo 51
In [13]:
# Haas
url = 'https://en.wikipedia.org/wiki/Haas_F1_Team'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

haas_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  haas_drivers.append(df_t[0])

haas_drivers[1] = haas_drivers[1][['Year', 'Drivers']]
haas_drivers[1] = haas_drivers[1].dropna()
haas_drivers[1].drop([22], axis =0, inplace =True)

haas_drivers[1]['Constructor'] = 'Haas F1 Team'
haas_drivers[1]['constructorId'] = '210'
haas_drivers[1] = haas_drivers[1].rename(columns={"Drivers": "Driver"})


haas_drivers[1]
Out[13]:
Year Driver Constructor constructorId
1 2016 Romain Grosjean Haas F1 Team 210
2 2016 Esteban Gutiérrez Haas F1 Team 210
4 2017 Romain Grosjean Haas F1 Team 210
5 2017 Kevin Magnussen Haas F1 Team 210
7 2018 Romain Grosjean Haas F1 Team 210
8 2018 Kevin Magnussen Haas F1 Team 210
10 2019 Romain Grosjean Haas F1 Team 210
11 2019 Kevin Magnussen Haas F1 Team 210
13 2020 Romain Grosjean Haas F1 Team 210
14 2020 Pietro Fittipaldi Haas F1 Team 210
15 2020 Kevin Magnussen Haas F1 Team 210
17 2021 Nikita Mazepin[c] Haas F1 Team 210
18 2021 Mick Schumacher Haas F1 Team 210
20 2022 Kevin Magnussen Haas F1 Team 210
21 2022 Mick Schumacher Haas F1 Team 210
In [14]:
# Williams Racing 

url = 'https://en.wikipedia.org/wiki/Williams_Grand_Prix_results'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
# importing the ability to use beautiful soup 
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )

williams_drivers = []
for t in soup.findAll("table"):
  df_t = pd.read_html(str(t))
  williams_drivers.append(df_t[0])

williams_drivers = pd.concat([williams_drivers[4], williams_drivers[5]], ignore_index=True)

williams_drivers = williams_drivers[['Year', 'Drivers']]
williams_drivers = williams_drivers.dropna()
williams_drivers.drop([31], axis =0, inplace =True)
williams_drivers.drop([43], axis =0, inplace =True)


# Adding constructor name 
williams_drivers['Constructor'] = 'Williams'
williams_drivers = williams_drivers.rename(columns={"Drivers": "Driver"})
williams_drivers['constructorId'] = '3'

williams_drivers
Out[14]:
Year Driver Constructor constructorId
1 2010 Rubens Barrichello Williams 3
2 2010 Nico Hülkenberg Williams 3
4 2011 Rubens Barrichello Williams 3
5 2011 Pastor Maldonado Williams 3
7 2012 Pastor Maldonado Williams 3
8 2012 Bruno Senna Williams 3
10 2013 Pastor Maldonado Williams 3
11 2013 Valtteri Bottas Williams 3
13 2014 Felipe Massa Williams 3
14 2014 Valtteri Bottas Williams 3
16 2015 Felipe Massa Williams 3
17 2015 Valtteri Bottas Williams 3
19 2016 Felipe Massa Williams 3
20 2016 Valtteri Bottas Williams 3
22 2017 Lance Stroll Williams 3
23 2017 Felipe Massa Williams 3
24 2017 Paul di Resta Williams 3
26 2018 Lance Stroll Williams 3
27 2018 Sergey Sirotkin Williams 3
29 2019 George Russell Williams 3
30 2019 Robert Kubica Williams 3
33 2020 Nicholas Latifi Williams 3
34 2020 George Russell Williams 3
35 2020 Jack Aitken Williams 3
37 2021 Nicholas Latifi Williams 3
38 2021 George Russell Williams 3
40 2022 Alexander Albon Williams 3
41 2022 Nicholas Latifi Williams 3
42 2022 Nyck de Vries Williams 3
In [15]:
# COMBINING EACH OF THE DATAFRAMES FROM THE WEB SCRAPING INTO A LARGE DATAFRAME
current_drivers = pd.concat([haas_drivers[1], alfa_drivers[2],aston_drivers[3], alpha_drivers[1], alpine_drivers[1]
, mercedes_drivers[5], red_bull_drivers[3], mclaren_drivers, williams_drivers, ferrari_drivers], ignore_index=True)
current_drivers
# THIS DATAFRAME SHOWS THAT THE DRIVERS, THEIR TEAMS, AND THE YEAR FOR THE TEN CURRENT F1 TEAMS
Out[15]:
Year Driver Constructor constructorId
0 2016 Romain Grosjean Haas F1 Team 210
1 2016 Esteban Gutiérrez Haas F1 Team 210
2 2017 Romain Grosjean Haas F1 Team 210
3 2017 Kevin Magnussen Haas F1 Team 210
4 2018 Romain Grosjean Haas F1 Team 210
... ... ... ... ...
185 2020 Sebastian Vettel Ferrari 6
186 2021 Charles Leclerc Ferrari 6
187 2021 Carlos Sainz Jr. Ferrari 6
188 2022 Charles Leclerc Ferrari 6
189 2022 Carlos Sainz Jr. Ferrari 6

190 rows × 4 columns

EXTRACTION, TRANSFORMATION & LOADING

Screen Shot 2022-12-13 at 4.27.36 PM.png

Cleaning the lap_times dataframe & displaying dtypes for incorporation into the regression model later in this project

In [16]:
# RENAMING COLUMNS TO BE UNDERSTANDABLE
lap_times = lap_times.rename(columns={"time": "lap_time", "milliseconds": "lap_in_milli", "position" : "position_after_lap"})
# WE WILL ONLY BE USING MILLISECONDS, DROPPING LAP TIME IN MINUTES AND SECONDS
lap_times = lap_times.drop(columns=['lap_time'])
lap_times
Out[16]:
raceId driverId lap position_after_lap lap_in_milli
0 841 20 1 1 98109
1 841 20 2 1 93006
2 841 20 3 1 92713
3 841 20 4 1 92803
4 841 20 5 1 92342
... ... ... ... ... ...
538116 1096 822 53 16 92998
538117 1096 822 54 16 92995
538118 1096 822 55 16 91236
538119 1096 822 56 15 90566
538120 1096 822 57 15 90743

538121 rows × 5 columns

In [17]:
# ALL OF THE VARIABLES IN LAP_TIMES SHOULD BE QUANTITATIVE. IDS, POSITIONS, AND TIMES SHOULD BE INTEGERS THAT WE CAN DO CALCULATIONS WITH
lap_times.dtypes
Out[17]:
raceId                int64
driverId              int64
lap                   int64
position_after_lap    int64
lap_in_milli          int64
dtype: object

Screen Shot 2022-12-13 at 4.31.41 PM.png

In [18]:
# RENAMING COLUMNS TO BE UNDERSTANDABLE
pit_stops = pit_stops.rename(columns={"time" : "time_when_stopped", "milliseconds": "stop_in_milli", "stop" : "stop_#"})
# WE WILL ONLY BE USING MILLISECONDS, DROPPING LAP TIME IN MINUTES AND SECONDS
pit_stops = pit_stops.drop(columns=['duration'])
pit_stops
Out[18]:
raceId driverId stop_# lap time_when_stopped stop_in_milli
0 841 153 1 1 17:05:23 26898
1 841 30 1 1 17:05:52 25021
2 841 17 1 11 17:20:48 23426
3 841 4 1 12 17:22:34 23251
4 841 13 1 13 17:24:10 23842
... ... ... ... ... ... ...
9629 1096 849 2 38 18:02:50 25174
9630 1096 840 2 40 18:04:44 21802
9631 1096 839 2 41 18:06:09 21734
9632 1096 846 2 42 18:07:36 21559
9633 1096 855 2 44 18:11:10 23159

9634 rows × 6 columns

In [19]:
# ALL OF THE VARIABLES IN PIT_STOPS SHOULD BE QUANTITATIVE. IDS, STOP NUMBERS, AND TIMES SHOULD BE INTEGERS THAT WE CAN DO CALCULATIONS WITH
pit_stops.dtypes
Out[19]:
raceId                int64
driverId              int64
stop_#                int64
lap                   int64
time_when_stopped    object
stop_in_milli         int64
dtype: object

Screen Shot 2022-12-13 at 9.53.47 PM.png

How do qualifying rounds work in Formula One?

There are three qualifying rounds for each race weekend in Formula One. These rounds take place typically over two days prior to the race on Sunday. In the first round, all twenty Formula One drivers participate. The driver's race one lap individually and attempt to recieve the fastest time. Those in the top fifteen will move on to the second round of qualifiers. Those with the bottom five times, will start race day in the last five positions on the grid. In the second round of qualifiers, drivers drive one lap, once again aiming for the faastest time. Those with the fastest ten times will move on to the third round of qualifiers. Those who had the bottom five race times in Q2 will be placed in spots 10-15 on the grid for Sunday's race. The final qualifying round determines the order of the drivers who will be in positions 1-10 on race day.

How should this information shape our data?

Since not all drivers will participate in all three rounds of qualifying, I am going to create dummy variables that will indicate (based on their success in qualifiers) whether a driver will race in the top ten, middle five, or bottom five.

In [20]:
# CLEANING THE QUALIFYING DATASET, DROPPING UNNEEDED VARIABLES
qualifying = qualifying.drop(columns = ["number"])
# REPLACING NAN WITH 0, THIS MEANS THAT THEY DID NOT RACE IN THAT ROUND, NOTE EVERYONE SHOULD HAVE A TIME FOR Q1 BUT NOT EVERYONE WILL MAKE IT TO Q3
qualifying = qualifying.replace('\\N','0')
# DUMMY TO INDICATE THAT THEY WERE PLACED IN TOP TEN STARTING POSITIONS (1-10)
qualifying['top_ten'] = np.where(qualifying['q3'] != '0' , 1, 0)
# DUMMY TO INDICATE THAT THEY WERE IN THE MIDDLE FIVE STARTING POSITIONS (11-15)
qualifying['middle'] = np.where((qualifying['q3'] == '0') & (qualifying['q2'] >'0'), 1, 0)
# DUMMY TO INDICARE THAT THEY WERE IN THE BOTTOM FIVE STARTING POSITIONS (16-20)
qualifying['bottom_five'] = np.where((qualifying['q3']=='0') & (qualifying['q2']=='0'), 1, 0)
# RENAMING THE COLUMN SO WE KNOW WHERE THEY STARTED
qualifying = qualifying.rename(columns= {"position" : "starting_position"})
# MERGING IN INFOMATION ABOUT WHAT RACE
qualifying = qualifying.merge(races, how='left', on="raceId")
qualifying = qualifying[['qualifyId', 'raceId', 'driverId','constructorId', 'starting_position', 'top_ten', 'middle', 'bottom_five', 'circuitId', 'year', 'name']]
qualifying
Out[20]:
qualifyId raceId driverId constructorId starting_position top_ten middle bottom_five circuitId year name
0 1 18 1 1 1 1 0 0 1 2008 Australian Grand Prix
1 2 18 9 2 2 1 0 0 1 2008 Australian Grand Prix
2 3 18 5 1 3 1 0 0 1 2008 Australian Grand Prix
3 4 18 13 6 4 1 0 0 1 2008 Australian Grand Prix
4 5 18 2 2 5 1 0 0 1 2008 Australian Grand Prix
... ... ... ... ... ... ... ... ... ... ... ...
9570 9628 1096 825 210 16 0 0 1 24 2022 Abu Dhabi Grand Prix
9571 9629 1096 842 213 17 0 0 1 24 2022 Abu Dhabi Grand Prix
9572 9630 1096 822 51 18 0 0 1 24 2022 Abu Dhabi Grand Prix
9573 9631 1096 848 3 19 0 0 1 24 2022 Abu Dhabi Grand Prix
9574 9632 1096 849 3 20 0 0 1 24 2022 Abu Dhabi Grand Prix

9575 rows × 11 columns

In [21]:
qualifying.dtypes
# ALL OF THE VARIABLES IN THE QUALIFYING DATAFRAME SHOULD BE INTEGERS AS THEY EITHER REPRESENT IDS, POSITIONS, OR DUMMY VARIABLES. 
# THE ONLY EXCEPTION TO THIS RULE IS THE NAME OF THE CIRCUIT
Out[21]:
qualifyId             int64
raceId                int64
driverId              int64
constructorId         int64
starting_position     int64
top_ten               int64
middle                int64
bottom_five           int64
circuitId             int64
year                  int64
name                 object
dtype: object

Screen Shot 2022-12-13 at 10.23.51 PM.png

In [22]:
# DROPING COLUMNS I DON'T NEED FROM THE ORIGINAL RESULTS COLUMN
results2 = results.drop(columns=['resultId', 'number', 'positionText','positionOrder', 'time'])
# RENAMING THE COLUMNS FOR UNDERSTANDABILITY
results2 = results2.rename(columns= {"grid" : "starting_position", "position":"finishing_position", "rank" : "overall_standing", "milliseconds":"time_milliseconds"})
# FILLING IN LAP TIMES THAT HAD NO DATA (INDICATING THAT THE DRIVER DID NOT RACE OR DID NOT FINISH THE RACE, WITH ZEROS)
results2 = results2.replace('\\N','0')
# CHANGING THE DTYPES TO BE QUANTITATIVE IF A NUMBER AND A INTERGER IF IT REPRESENTS A CATEGORY 
results2['time_milliseconds'] = results2['time_milliseconds'].astype(int)
results2['finishing_position'] = results2['finishing_position'].astype(int)
results2['overall_standing'] = results2['overall_standing'].astype(int)
results2['statusId'] = results2['statusId'].astype(str)
results2['fastestLap'] = results2['fastestLap'].astype(int)
results2['fastestLapSpeed'] = results2['fastestLapSpeed'].astype(float)
# ADDING A DUMMY TO INDICATE THAT THE DRIVER FINISHED THE RACE
results2['finished'] = np.where(results2['statusId']=='1', 1,0)
# ADDING DUMMIES THAT INDICATE WHERE THE DRIVER PLACED
results2['podium'] = np.where(results2['finishing_position']<=3, 1,0)
results2['top_ten'] = np.where((results2['finishing_position']>3) & (results2['finishing_position']<=10), 1,0)
results2['bottom_ten'] = np.where(results2['finishing_position']>10, 1,0)
results2 = results2.drop(columns=['fastestLapTime'])
results2
Out[22]:
raceId driverId constructorId starting_position finishing_position points laps time_milliseconds fastestLap overall_standing fastestLapSpeed statusId finished podium top_ten bottom_ten
0 18 1 1 1 1 10.0 58 5690616 39 2 218.300 1 1 1 0 0
1 18 2 2 5 2 8.0 58 5696094 41 3 217.586 1 1 1 0 0
2 18 3 3 7 3 6.0 58 5698779 41 5 216.719 1 1 1 0 0
3 18 4 4 11 4 5.0 58 5707797 58 7 215.464 1 1 0 1 0
4 18 5 1 3 5 4.0 58 5708630 43 1 218.385 1 1 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25835 1096 854 210 12 16 0.0 57 0 39 12 211.632 11 0 0 0 1
25836 1096 825 210 16 17 0.0 57 0 40 20 208.556 11 0 0 0 1
25837 1096 1 131 5 18 0.0 55 0 42 11 211.738 6 0 0 0 1
25838 1096 849 3 20 19 0.0 55 0 45 14 210.517 130 0 0 0 1
25839 1096 4 214 10 0 0.0 27 0 24 17 209.889 47 0 1 0 0

25840 rows × 16 columns

In [23]:
results2.dtypes
# ALL OF THE VARIABLES IN THE REFINED RESULTS DATAFRAME SHOULD BE NUMERICAL AS THEY REPRESENT TIMES, IDS, AND DUMMY VARIABLES 
# WE ARE KEEPING STATUSID AS AN OBJECT BECUASE THE NUMBERS REFER TO SPECIFIC CATEGORIES REGARDING HOW A DRIVER FINISHED OR WHAT MAY HAVE CAUSED THEM TO NOT COMPLETE A RACE
Out[23]:
raceId                  int64
driverId                int64
constructorId           int64
starting_position       int64
finishing_position      int64
points                float64
laps                    int64
time_milliseconds       int64
fastestLap              int64
overall_standing        int64
fastestLapSpeed       float64
statusId               object
finished                int64
podium                  int64
top_ten                 int64
bottom_ten              int64
dtype: object

Screen Shot 2022-12-13 at 10.30.06 PM.png

  1. Concat web scrapes & limit drivers: The first step towards my regression model is limiting the data to only include information on drivers who have raced for the 10 current Formula one teams. As mentioned in the data section, I scraped the Wikipedia pages of each team for this information. Now, I am going to create a list of unique drivers that will act as a foreign key to any dataframe I use going for
In [24]:
# LET'S CREATE A LIST OF THE DRIVERS WHO HAVE DRIVEN FOR THE 10 CURRENT TEAMS
names_of_drivers = current_drivers['Driver'].unique()
names_of_drivers = pd.DataFrame(data = names_of_drivers, columns = ['Driver'])
# THERE WERE A HANDFUL OF SPELLING MISTAKES IN DRIVER NAME BETWEEN THE SCRAPES AND KAGGLE
# I HAVE MANUALLY ADJUSTED THESE SO THAT MERGING WOULD BE DONE EASILY
names_of_drivers = names_of_drivers.replace("Nikita Mazepin[c]", "Nikita Mazepin")
names_of_drivers = names_of_drivers.replace("Carlos Sainz Jr.", "Carlos Sainz")
names_of_drivers = names_of_drivers.replace("Zhou Guanyu", "Guanyu Zhou")
names_of_drivers = names_of_drivers.replace("Alex Albon", "Alexander Albon")
# NOW LETS MERGE IN THE DRIVER IDS 
# FIRST I HAVE TO CREATE A DATAFRAME THE CONTAINS THE NAME OF THE DRIVERS AND THEIR FULL NAMES USING THE KAGGLE DATASETS
# THEN I WILL USE THIS DATAFRAME AND MERGE IT WITH OUR UNIQUE NAMES TO CREATE A FOREIGN KEY 
drivers_names  = pd.DataFrame(data=drivers[['driverId', 'forename', 'surname']])
drivers_names['Driver'] = drivers_names['forename'] + ' ' + drivers_names['surname']
drivers_names = drivers_names[['driverId', 'Driver']]

names_of_drivers = names_of_drivers.merge(drivers_names, how='left')
names_of_drivers =names_of_drivers.dropna()
names_of_drivers.head()
# THERE ARE 46 UNIQUE DRIVERS WHO HAVE DRIVEN OR ARE CURRENTLY DRIVING FOR THE 10 CURRENT F1 TEAMS
Out[24]:
Driver driverId
0 Romain Grosjean 154
1 Esteban Gutiérrez 821
2 Kevin Magnussen 825
3 Pietro Fittipaldi 850
4 Nikita Mazepin 853
  1. Build input dataframe for regression: With the unique list of drivers, we can now begin to modify the results data frame to contain information that will be needed for our model. I will merge in information about individual lap times, pitstops, lap speeds, and race level information on starting and finishing position.
In [25]:
# Downselecting the results dataframe for recent drivers only 
results_recent = results2[['driverId', 'raceId', 'starting_position', 'constructorId', 'finishing_position','fastestLapSpeed']]
# merge in circuit name based on raceID
results_recent = results_recent.merge(races, how='right')
results_recent = results_recent[['driverId', 'raceId', 'circuitId', 'name', 'year', 'constructorId','starting_position', 'finishing_position','fastestLapSpeed' ]]
results_recent = results_recent.dropna()
# I only want information for the list of 41 recent drivers 
results_recent = results_recent.merge(names_of_drivers, how='right')
# I now want information on their lap times in milliseconds
results_recent = results_recent.merge(lap_times, how='left')
# Adding constructor Name 
results_recent= results_recent.merge(constructors[['constructorId', 'constructorRef']], how = 'left', on = 'constructorId')
# limiting to 2010 onward
results_recent = results_recent.loc[results_recent['year'] >=2010]
# ADDING PITSTOP INFORMATION 
results_recent = results_recent.merge(pit_stops, how='left', on = ['driverId', 'lap', 'raceId'])
results_recent = results_recent.drop(columns=['position_after_lap', 'stop_#', 'time_when_stopped'])

results_recent
Out[25]:
driverId raceId circuitId name year constructorId starting_position finishing_position fastestLapSpeed Driver lap lap_in_milli constructorRef stop_in_milli
0 154 860 1 Australian Grand Prix 2012 208 3 0 0.000 Romain Grosjean 1.0 103730.0 lotus_f1 NaN
1 154 861 2 Malaysian Grand Prix 2012 208 6 0 155.333 Romain Grosjean 1.0 139773.0 lotus_f1 NaN
2 154 861 2 Malaysian Grand Prix 2012 208 6 0 155.333 Romain Grosjean 2.0 128464.0 lotus_f1 NaN
3 154 861 2 Malaysian Grand Prix 2012 208 6 0 155.333 Romain Grosjean 3.0 132909.0 lotus_f1 NaN
4 154 862 17 Chinese Grand Prix 2012 208 10 6 194.062 Romain Grosjean 1.0 112513.0 lotus_f1 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
243007 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 54.0 90095.0 ferrari NaN
243008 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 55.0 90151.0 ferrari NaN
243009 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 56.0 90033.0 ferrari NaN
243010 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 57.0 89829.0 ferrari NaN
243011 844 1096 24 Abu Dhabi Grand Prix 2022 6 3 2 211.901 Charles Leclerc 58.0 90500.0 ferrari NaN

243012 rows × 14 columns

  1. I will now modify the results dataframe heavily to only contain year to year averages. This was done by merging multiple different kaggle datasets, performing aggregations on specific columns, and deleting unneccesary variables. After duplicates were removed, we were left with year-to-year averages for each of our recents drivers since 2010
In [26]:
results_recent= results_recent[['Driver', 'driverId', 'year', 'constructorRef','raceId', 'circuitId', 'name', 'lap', 'lap_in_milli', 'stop_in_milli', 'fastestLapSpeed', 'starting_position', 'finishing_position']]
#ADDING A COLUMN THAT REPRESENTS THE AVERAGE LAP SPEED FOR A GIVEN YEAR 
avg_lap = results_recent.groupby(['year', 'Driver'], as_index=False)[['lap_in_milli']].mean()
avg_lap = avg_lap.rename(columns= {"lap_in_milli" : "avg_lap"})
# MERGING BACK INTO RESULTS RECENT 
results_recent= results_recent.merge(avg_lap[['year', 'Driver', 'avg_lap']], how = 'left', on = ['year', 'Driver'] )
# ADDING A COLUMN THAT REPRESENT THE AVERAGE PIT STOP FOR A GIVEN YEAR
avg_pit = results_recent.groupby(['year', 'Driver'], as_index=False)[['stop_in_milli']].mean()
avg_pit = avg_pit.rename(columns= {"stop_in_milli" : "avg_pit"})
avg_pit
# MERGING BACK INTO RESULTS
results_recent= results_recent.merge(avg_pit[['year', 'Driver', 'avg_pit']], how = 'left', on = ['year', 'Driver'] )
# ADDING IN INFORMATION ON AVG FASTEST LAP SPEED 
avg_fastest_speed = results_recent.groupby(['year', 'Driver'], as_index=False)[['fastestLapSpeed']].mean()
avg_fastest_speed  = avg_fastest_speed.rename(columns= {"fastestLapSpeed" : "avg_fastest_speed"})
results_recent= results_recent.merge(avg_fastest_speed[['year', 'Driver', 'avg_fastest_speed']], how = 'left', on = ['year', 'Driver'] )
# DROPPING THE INDIV LAP SPEEDS AND PITS
results_recent = results_recent.drop(columns=['lap_in_milli', 'stop_in_milli', 'fastestLapSpeed'])
# ADDING IN INFORMATION ON AVERAGE STARTING POSITION 
avg_start = results_recent.groupby(['year', 'Driver'], as_index=False)[['starting_position']].mean()
avg_start = avg_start.rename(columns= {"starting_position" : "avg_start"})
results_recent= results_recent.merge(avg_start[['year', 'Driver', 'avg_start']], how = 'left', on = ['year', 'Driver'] )
# ADDING IN INFORMATION ON AVERAGE FINISHING POSITION
avg_finish = results_recent.groupby(['year', 'Driver'], as_index=False)[['finishing_position']].mean()
avg_finish = avg_finish.rename(columns= {"finishing_position" : "avg_finish"})
results_recent= results_recent.merge(avg_finish[['year', 'Driver', 'avg_finish']], how = 'left', on = ['year', 'Driver'] )
# DROPPING INFORMATION ABOUT STARTING AND FINISHING POSTIOIN FOR EACH RACE 
results_recent = results_recent.drop(columns=['starting_position', 'finishing_position'])
# WORKING TOWARDS THE NUMBER OF WINS PER SEASON 
driver_standings=driver_standings.merge(results_recent[['driverId', 'Driver', 'raceId']], on = ['raceId', 'driverId'] )
results_recent= results_recent.merge(driver_standings[['raceId', 'driverId', 'wins']], how = 'left', on = ['raceId', 'driverId'] )
max_wins =results_recent.groupby(['year', 'Driver'], as_index=False)[['wins']].max()
max_wins  = max_wins.rename(columns= {"wins" : "max_wins"})
results_recent= results_recent.merge(max_wins[['year', 'Driver', 'max_wins']], how = 'left', on = ['year', 'Driver'] )
results_recent = results_recent.drop(columns=['wins'])
results_recent = results_recent.drop(columns=['raceId', 'circuitId', 'name', 'lap'])
# I AM DELETING DUPLICATES BECAUSE I WANT YEAR AVERAGES NOT RACE BY RACE DATA
results_recent= results_recent.drop_duplicates(keep='first')

results_recent
Out[26]:
Driver driverId year constructorRef avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins
0 Romain Grosjean 154 2012 lotus_f1 98661.940224 24162.448276 195.716262 7.816149 6.604969 0.0
46311 Romain Grosjean 154 2013 lotus_f1 101362.287643 24448.615385 199.298737 8.629283 6.069574 0.0
100804 Romain Grosjean 154 2014 lotus_f1 104962.790136 51355.487805 192.494907 15.045121 10.006296 0.0
153479 Romain Grosjean 154 2015 lotus_f1 97439.547258 25392.606061 188.249763 10.759022 8.215367 0.0
205274 Romain Grosjean 154 2016 haas 98754.771845 76684.450000 198.282209 14.428295 10.931202 0.0
... ... ... ... ... ... ... ... ... ... ...
14536413 Charles Leclerc 844 2018 sauber 93275.184310 24773.640000 204.704019 12.451887 8.588679 0.0
14598871 Charles Leclerc 844 2019 ferrari 92773.335640 24281.250000 212.168689 3.787197 4.152249 2.0
14666219 Charles Leclerc 844 2020 ferrari 100032.670316 160831.714286 214.948978 8.012151 6.274605 0.0
14712986 Charles Leclerc 844 2021 ferrari 97752.122271 230268.000000 215.072843 6.623365 6.410636 0.0
14716955 Charles Leclerc 844 2022 ferrari 97835.053827 129264.767442 209.083621 4.234651 3.190917 3.0

244 rows × 10 columns

In [27]:
# ADDING INFO ON THE CONSTRUCTOR STANDINGS 
# I WILL USE MAX POINTS FOR A SEASON (THE POINTS THAT A CONSTRUCTOR HAD AT THE END OF THE SEASON) AS A PROXY FOR CONSTRUCTOR STANDINGS
results_recent= results_recent.merge(constructors[['constructorId', 'constructorRef']], how = 'left', on = ['constructorRef'] )
results_recent= results_recent.merge(constructor_standings[['points', 'constructorId']], how = 'left', on = ['constructorId'] )
points_szn = results_recent.groupby(['year', 'constructorId'], as_index=False)[['points']].max()
points_szn  = points_szn.rename(columns= {"points" : "points_szn"})
results_recent= results_recent.merge(points_szn[['year', 'constructorId', 'points_szn']], how = 'left', on = ['year', 'constructorId'])
results_recent = results_recent.drop(columns=['points'])
results_recent= results_recent.drop_duplicates(keep='first')
results_recent
Out[27]:
Driver driverId year constructorRef avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins constructorId points_szn
0 Romain Grosjean 154 2012 lotus_f1 98661.940224 24162.448276 195.716262 7.816149 6.604969 0.0 208 315.0
77 Romain Grosjean 154 2013 lotus_f1 101362.287643 24448.615385 199.298737 8.629283 6.069574 0.0 208 315.0
154 Romain Grosjean 154 2014 lotus_f1 104962.790136 51355.487805 192.494907 15.045121 10.006296 0.0 208 315.0
231 Romain Grosjean 154 2015 lotus_f1 97439.547258 25392.606061 188.249763 10.759022 8.215367 0.0 208 315.0
308 Romain Grosjean 154 2016 haas 98754.771845 76684.450000 198.282209 14.428295 10.931202 0.0 210 93.0
... ... ... ... ... ... ... ... ... ... ... ... ...
107894 Charles Leclerc 844 2018 sauber 93275.184310 24773.640000 204.704019 12.451887 8.588679 0.0 15 126.0
108282 Charles Leclerc 844 2019 ferrari 92773.335640 24281.250000 212.168689 3.787197 4.152249 2.0 6 571.0
109294 Charles Leclerc 844 2020 ferrari 100032.670316 160831.714286 214.948978 8.012151 6.274605 0.0 6 571.0
110306 Charles Leclerc 844 2021 ferrari 97752.122271 230268.000000 215.072843 6.623365 6.410636 0.0 6 571.0
111318 Charles Leclerc 844 2022 ferrari 97835.053827 129264.767442 209.083621 4.234651 3.190917 3.0 6 571.0

244 rows × 12 columns

For reference throughout the analysis, I have created a table that shows which team our recent drivers have raced for since 2010. If there is a NaN in this table, it represents that the driver was not driving this year

In [28]:
# I want to create a graph that will show which drivers have switched teams within this dataset
# Creating a pivot table containing information on year, driver, and the team they were racing for
results_pivot = results_recent.pivot_table(
    index="Driver", columns =['year'], 
    values = "constructorRef", aggfunc=np.max
)
results_pivot
Out[28]:
year 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
Driver
Alexander Albon NaN NaN NaN NaN NaN NaN NaN NaN NaN toro_rosso red_bull NaN williams
Antonio Giovinazzi NaN NaN NaN NaN NaN NaN NaN sauber NaN alfa alfa alfa NaN
Bruno Senna hrt renault williams NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Carlos Sainz NaN NaN NaN NaN NaN toro_rosso toro_rosso toro_rosso renault mclaren mclaren ferrari ferrari
Charles Leclerc NaN NaN NaN NaN NaN NaN NaN NaN sauber ferrari ferrari ferrari ferrari
Christian Klien hrt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Daniel Ricciardo NaN hrt toro_rosso toro_rosso red_bull red_bull red_bull red_bull red_bull renault renault mclaren mclaren
Daniil Kvyat NaN NaN NaN NaN toro_rosso red_bull toro_rosso toro_rosso NaN toro_rosso alphatauri NaN NaN
Esteban Gutiérrez NaN NaN NaN sauber sauber NaN haas NaN NaN NaN NaN NaN NaN
Esteban Ocon NaN NaN NaN NaN NaN NaN manor force_india force_india NaN renault alpine alpine
Felipe Massa ferrari ferrari ferrari ferrari williams williams williams williams NaN NaN NaN NaN NaN
Fernando Alonso ferrari ferrari ferrari ferrari ferrari mclaren mclaren mclaren mclaren NaN NaN alpine alpine
George Russell NaN NaN NaN NaN NaN NaN NaN NaN NaN williams williams williams mercedes
Guanyu Zhou NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN alfa
Jack Aitken NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN williams NaN NaN
Jenson Button mclaren mclaren mclaren mclaren mclaren mclaren mclaren mclaren NaN NaN NaN NaN NaN
Kevin Magnussen NaN NaN NaN NaN mclaren mclaren renault haas haas haas haas NaN haas
Kimi Räikkönen NaN NaN lotus_f1 lotus_f1 ferrari ferrari ferrari ferrari ferrari alfa alfa alfa NaN
Lance Stroll NaN NaN NaN NaN NaN NaN NaN williams williams racing_point racing_point aston_martin aston_martin
Lando Norris NaN NaN NaN NaN NaN NaN NaN NaN NaN mclaren mclaren mclaren mclaren
Lewis Hamilton mclaren mclaren mclaren mercedes mercedes mercedes mercedes mercedes mercedes mercedes mercedes mercedes mercedes
Mark Webber red_bull red_bull red_bull red_bull NaN NaN NaN NaN NaN NaN NaN NaN NaN
Max Verstappen NaN NaN NaN NaN NaN toro_rosso toro_rosso red_bull red_bull red_bull red_bull red_bull red_bull
Michael Schumacher mercedes mercedes mercedes NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Mick Schumacher NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN haas haas
Nicholas Latifi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN williams williams williams
Nico Hülkenberg williams NaN force_india sauber force_india force_india force_india renault renault renault racing_point NaN aston_martin
Nico Rosberg mercedes mercedes mercedes mercedes mercedes mercedes mercedes NaN NaN NaN NaN NaN NaN
Nikita Mazepin NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN haas NaN
Nyck de Vries NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN williams
Pastor Maldonado NaN williams williams williams lotus_f1 lotus_f1 NaN NaN NaN NaN NaN NaN NaN
Paul di Resta NaN force_india force_india force_india NaN NaN NaN williams NaN NaN NaN NaN NaN
Pierre Gasly NaN NaN NaN NaN NaN NaN NaN toro_rosso toro_rosso toro_rosso alphatauri alphatauri alphatauri
Pietro Fittipaldi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN haas NaN NaN
Robert Kubica renault NaN NaN NaN NaN NaN NaN NaN NaN williams NaN alfa NaN
Romain Grosjean NaN NaN lotus_f1 lotus_f1 lotus_f1 lotus_f1 haas haas haas haas haas NaN NaN
Rubens Barrichello williams williams NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Sebastian Vettel red_bull red_bull red_bull red_bull red_bull ferrari ferrari ferrari ferrari ferrari ferrari aston_martin aston_martin
Sergey Sirotkin NaN NaN NaN NaN NaN NaN NaN NaN williams NaN NaN NaN NaN
Sergio Pérez NaN sauber sauber mclaren force_india force_india force_india force_india force_india racing_point racing_point red_bull red_bull
Stoffel Vandoorne NaN NaN NaN NaN NaN NaN mclaren mclaren mclaren NaN NaN NaN NaN
Valtteri Bottas NaN NaN NaN williams williams williams williams mercedes mercedes mercedes mercedes mercedes alfa
Vitantonio Liuzzi force_india hrt NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Yuki Tsunoda NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN alphatauri alphatauri
  1. Based on the table above, I picked four drivers and have decided to do a regression discontinuity analysis on five important success metrics: average lap time, average pitstop time, average fastest lap speed, average starting position, and average finishing position. These averages are in relation to the year as a whole. I choose the seven drivers based on their noterity, the amount of data they had and whether they switched teams at least once in their career. In the end, I choose Max Verstappen, Lewis Hamilton, Sebastian Vettel, and Daniel Ricciardo

    First I will create a graph that shows how these four drivers switched teams over the year. This graph is purely for visuals.

In [29]:
results_seven = results_recent.loc[(results_recent['Driver']=="Daniel Ricciardo") |  (results_recent['Driver']== "Lewis Hamilton" )| (results_recent['Driver']== "Max Verstappen") | (results_recent['Driver']== "Sebastian Vettel")]
In [30]:
# Now we are going to create a line plot that shows how these drivers moved over the years 
# To limit how crezy the graph is, I am limiting my regression discontinuity to four of the drivers. The drivers had more data and at least one team switch 
plt.figure(figsize=(40,20))
sns.lineplot(data=results_seven, x="year", y="constructorRef", hue="Driver", style = "Driver", markers = True, estimator=None,linewidth=10).set(title='The Four Drivers and their team history since 2010')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), prop={'size': 25})
Out[30]:
<matplotlib.legend.Legend at 0x7f2e64e32610>
  1. Prove Regression Discontinuity for Constructors: I will now analyze these four drivers and see how their averages were affected pre and post switches

Where did my idea for constructor v. driver come from?

Regression Discontinuity Explanation

Screen Shot 2022-12-15 at 11.37.34 AM.png

In 2015 Max Verstappen was racing for Torro Rosso, who was ranked 7th overall. In 2016, he switched teams and raced for RedBull, who was ranked 4th in 2015. We want to see if Max's average lap time, average pitstop time, fastest Lap speed and other season indicators were affected by switching to a 'better' team.

In [ ]:
max_verstappen = results_recent.loc[(results_recent['Driver'] == 'Max Verstappen')]
max_verstappen
Out[ ]:
Driver driverId year constructorRef avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins
10677429 Max Verstappen 830 2015 toro_rosso 97788.703415 25845.048780 194.527915 10.929756 8.470244 0.0
10737834 Max Verstappen 830 2016 toro_rosso 102161.550170 129891.115385 200.090821 5.076531 4.308673 1.0
10748557 Max Verstappen 830 2016 red_bull 102161.550170 129891.115385 200.090821 5.076531 4.308673 1.0
10807006 Max Verstappen 830 2017 red_bull 92399.428571 24077.080000 206.163878 6.506699 3.975639 2.0
10854989 Max Verstappen 830 2018 red_bull 92339.323869 23535.192308 209.936718 6.717835 3.203194 2.0
10922890 Max Verstappen 830 2019 red_bull 91277.877119 23682.968750 208.051921 4.508891 3.413209 3.0
10995909 Max Verstappen 830 2020 red_bull 96558.793711 125957.653846 216.819662 3.111669 2.150565 2.0
11041476 Max Verstappen 830 2021 red_bull 92551.655656 141722.790698 212.078681 2.940594 1.797030 10.0
11045445 Max Verstappen 830 2022 red_bull 98011.432730 120640.468085 208.416640 3.322581 2.694729 15.0
In [ ]:
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Max Verstappen Yearly Averages (centered at 2016)")

ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_pit", ax=ax)

ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)

ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_start", ax=ax)

ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2016, color = 'green', linestyle='-.')
max_verstappen.plot.scatter(x="year", y="avg_finish", ax=ax);

Screen Shot 2022-12-15 at 7.33.44 PM.png

Screen Shot 2022-12-15 at 11.39.59 AM.png

In 2014 Sebastian Vettel was racing for Red Bull, who was ranked 2nd overall. In 2015, he switched teams and raced for Ferrari, who was ranked 4th in 2014. We want to see if Sebastian's average lap time, average pitstop time and fastest Lap speed was affected by switching to a 'lesser' team.

Additionally, in 2020 Sebastian was racing for Ferrari, who was ranked 6th overall. In 2021, he switched teams and raced for Aston Martin, who would be new to the grid in 2021. We want to see how this new team affected hisaverage lap time, average pitstop time and fastest Lap speed.

In [ ]:
seb_vettel = results_recent.loc[(results_recent['Driver'] == 'Sebastian Vettel')]
seb_vettel
Out[ ]:
Driver driverId year constructorRef avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins
3436739 Sebastian Vettel 20 2010 red_bull 101126.229108 NaN 201.130300 1.943662 3.001878 5.0
3499434 Sebastian Vettel 20 2011 red_bull 105831.397590 22273.000000 200.172277 1.271548 1.559778 11.0
3565169 Sebastian Vettel 20 2012 red_bull 101035.189329 21982.568182 198.304904 5.036145 4.196213 5.0
3634699 Sebastian Vettel 20 2013 red_bull 98646.623214 22223.113636 199.887430 2.033036 1.588393 13.0
3702207 Sebastian Vettel 20 2014 red_bull 105008.240741 52057.317073 197.300866 7.573045 4.817901 0.0
3758559 Sebastian Vettel 20 2015 ferrari 97092.955634 24540.405405 197.653485 5.275954 3.046140 3.0
3826904 Sebastian Vettel 20 2016 ferrari 102160.181646 160969.750000 197.123366 6.129245 3.921698 0.0
3891520 Sebastian Vettel 20 2017 ferrari 92791.092746 60077.857143 208.748014 3.184404 2.473394 5.0
3958482 Sebastian Vettel 20 2018 ferrari 92496.365385 24363.551724 209.709708 2.893429 2.999199 5.0
4034278 Sebastian Vettel 20 2019 ferrari 91593.834465 24942.513514 209.795354 4.456706 4.787776 1.0
4105496 Sebastian Vettel 20 2020 ferrari 96365.301969 148039.129032 214.892423 11.836980 10.206783 0.0
4161868 Sebastian Vettel 20 2021 aston_martin 97751.538401 232763.860465 209.312232 11.346793 9.992874 0.0
4239643 Sebastian Vettel 20 2022 aston_martin 99003.257039 162892.121212 200.975655 12.933697 10.574932 0.0
In [ ]:
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Sebastian Vettel's Yearly Averages (centered at 2015 & 2021)")

ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_pit", ax=ax)

ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)

ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_start", ax=ax)

ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2015, color = 'green', linestyle='-.')
ax.axvline(x = 2021, color = 'red', linestyle='-.')
seb_vettel.plot.scatter(x="year", y="avg_finish", ax=ax);

Screen Shot 2022-12-15 at 7.41.34 PM.png

When Sebastian switched from RedBull (2) to Ferrari (4), his average lap time, average pitstop time decreased and his fastest lap speed remained unaffected. When Sebastian switched from Ferrari (6) to Aston Martin (unranked) his average lap time, average pitstop time, and average fastest lap speed decreased.

Screen Shot 2022-12-15 at 11.47.55 AM.png

In 2012 Lewis Hamilton was racing for McLaren, who was ranked 3rd overall. In 2013, he switched teams and raced for Mercedes, who was ranked 5th in 2012. We want to see if Lewis's average lap time, average pitstop time and fastest Lap speed was affected by switching to a 'lesser' team.

In [ ]:
hamilton = results_recent.loc[(results_recent['Driver'] == 'Lewis Hamilton')]
hamilton
Out[ ]:
Driver driverId year constructorRef avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins
7330335 Lewis Hamilton 1 2010 mclaren 101991.976024 NaN 200.273699 5.179641 3.648703 3.0
7388849 Lewis Hamilton 1 2011 mclaren 99405.792695 22666.446429 197.448912 3.648569 3.548865 3.0
7448040 Lewis Hamilton 1 2012 mclaren 99488.327273 22671.861111 197.198976 4.449331 4.443595 4.0
7509032 Lewis Hamilton 1 2013 mercedes 99398.292435 22399.111111 197.897322 3.437269 5.154982 1.0
7574672 Lewis Hamilton 1 2014 mercedes 102033.639391 53061.974359 196.309225 4.821123 1.388202 11.0
7637687 Lewis Hamilton 1 2015 mercedes 96220.252679 23666.000000 199.423454 1.493750 1.766964 10.0
7705961 Lewis Hamilton 1 2016 mercedes 101508.330523 151195.581395 200.290594 3.914912 1.919966 10.0
7778242 Lewis Hamilton 1 2017 mercedes 94618.848536 59090.972222 207.880191 4.039331 2.882008 9.0
7851219 Lewis Hamilton 1 2018 mercedes 92570.843825 23008.192308 209.491248 2.807968 1.898008 11.0
7927706 Lewis Hamilton 1 2019 mercedes 91899.344691 25168.823529 210.759868 2.332013 2.445325 11.0
8005168 Lewis Hamilton 1 2020 mercedes 97792.070526 196493.419355 221.589408 1.904211 1.841053 11.0
8062512 Lewis Hamilton 1 2021 mercedes 96207.248227 214770.595745 213.329920 3.160757 2.669031 8.0
8066481 Lewis Hamilton 1 2022 mercedes 97573.824238 144233.815789 206.708411 6.457097 5.238172 0.0
In [ ]:
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Lewis Hamilton Yearly Averages (centered at 2013)")

ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_pit", ax=ax)

ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)

ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_start", ax=ax)

ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2013, color = 'green', linestyle='-.')
hamilton.plot.scatter(x="year", y="avg_finish", ax=ax);

Screen Shot 2022-12-15 at 7.49.30 PM.png

Screen Shot 2022-12-15 at 12.03.33 PM.png

Daniel Ricciardo has has spent his relatively short Formula One career split between five teams: HRT, Torro Rosso, RedBull, Renault, and McLaren. For simplicity, I will only analyze the largest jumps of his career: RedBull to Renault. After driving for RedBull for five years, Ricciardo switched in 2019 from a team ranked 3rd to Renauly who was ranked 5th. Although this may not seem like a large jump, the top three teams in Formula One consistently score the best and the remaining seven tend to fall to the wayside. It is uncommon to see a driver in their prime voluntarily switch to a lower team and thus I want to see how Ricciardo's average lap time, average pitstop, and fastest lap times were affected.

In [ ]:
ricciardo = results_recent.loc[(results_recent['Driver'] == 'Daniel Ricciardo') & ((results_recent['year']>=2014) & (results_recent['year']<=2020))]
ricciardo
Out[ ]:
Driver driverId year constructorRef avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins
8902174 Daniel Ricciardo 817 2014 red_bull 102773.415679 54115.675000 184.666441 6.052871 3.257976 3.0
8967455 Daniel Ricciardo 817 2015 red_bull 97161.275000 25076.342105 194.779988 8.193519 7.800926 0.0
9033087 Daniel Ricciardo 817 2016 red_bull 101457.872928 124046.600000 199.839255 3.854775 4.340174 1.0
9111154 Daniel Ricciardo 817 2017 red_bull 96295.504425 63537.593750 208.160422 7.057459 3.185635 1.0
9163489 Daniel Ricciardo 817 2018 red_bull 92463.852998 24094.550000 207.669317 7.828820 3.017408 2.0
9223029 Daniel Ricciardo 817 2019 renault 92436.400000 25684.760000 205.617996 10.615179 7.702679 0.0
9288737 Daniel Ricciardo 817 2020 renault 96597.561798 208195.137931 218.050905 7.628192 6.644535 0.0
In [ ]:
plt.figure(figsize=(20,20))
ax = plt.subplot(5,1,1)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_lap", ax=ax)
plt.title("Daniel Ricciardo's Yearly Averages (centered at 2019)")

ax = plt.subplot(5,1,2, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_pit", ax=ax)

ax = plt.subplot(5,1,3, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_fastest_speed", ax=ax)

ax = plt.subplot(5,1,4, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_start", ax=ax)

ax = plt.subplot(5,1,5, sharex=ax)
ax.axvline(x = 2019, color = 'green', linestyle='-.')
ricciardo.plot.scatter(x="year", y="avg_finish", ax=ax);

Screen Shot 2022-12-15 at 7.54.15 PM.png

Regression Discontinuity Analysis Conclusions

For the most part, the drivers and their switches followed our expectations. When Max Verstappen switched to a 'better' team, his success indicators were immediately affected positively and he continued to have larger positive affect each year. When Daniel Ricciardo switched to a 'worse; team, his success indicators were negatvily affected in the ensuring season.

There were certain anomolies in Lewis Hamilton's switch from McLaren to Mercedes. Although Mercedes was a lower ranking team in the previous season, Lewis's switch had positive affects on his success indicators.

Additionally, Sebestian Vettel's first switch from RedBull to Ferrari did not follow expectations, but his second switch from Ferrari to Aston Martin did follow expectations.

What we can see from this is that constructors do play a role in success, however Formula One is a highly volitile sport and teams can change drastically year to year, espicially ones with a strong historical footprint like Mercedes or Ferrari.

Going forward, I will still incorporate constructor success into my model but I will not rely on it solely.

Screen Shot 2022-12-15 at 11.24.40 PM.png

In [ ]:
results_recent['prev_szn(avg_lap)'] = results_recent['avg_lap'].shift(1)
results_recent['prev_szn(avg_pit)'] = results_recent['avg_pit'].shift(1)
results_recent['prev_szn(avg_fastest_speed)'] = results_recent['avg_fastest_speed'].shift(1)
results_recent['prev_szn(avg_start)'] = results_recent['avg_start'].shift(1)
results_recent['prev_szn(avg_finish)'] = results_recent['avg_finish'].shift(1)
results_recent['prev_szn(constructor)'] = results_recent['constructorRef'].shift(1)
results_recent['prev_szn(points_szn)'] = results_recent['points_szn'].shift(1)

# WE NEED TO GET DUMMIES FOR EACH OF THE INSTRUCTORS 
results_recent.dropna(inplace=True)

results_recent
Out[ ]:
Driver driverId year constructorRef avg_lap avg_pit avg_fastest_speed avg_start avg_finish max_wins constructorId points_szn prev_szn(avg_lap) prev_szn(avg_pit) prev_szn(avg_fastest_speed) prev_szn(avg_start) prev_szn(avg_finish) prev_szn(constructor) prev_szn(points_szn)
77 Romain Grosjean 154 2013 lotus_f1 101362.287643 24448.615385 199.298737 8.629283 6.069574 0.0 208 315.0 98661.940224 24162.448276 195.716262 7.816149 6.604969 lotus_f1 315.0
154 Romain Grosjean 154 2014 lotus_f1 104962.790136 51355.487805 192.494907 15.045121 10.006296 0.0 208 315.0 101362.287643 24448.615385 199.298737 8.629283 6.069574 lotus_f1 315.0
231 Romain Grosjean 154 2015 lotus_f1 97439.547258 25392.606061 188.249763 10.759022 8.215367 0.0 208 315.0 104962.790136 51355.487805 192.494907 15.045121 10.006296 lotus_f1 315.0
308 Romain Grosjean 154 2016 haas 98754.771845 76684.450000 198.282209 14.428295 10.931202 0.0 210 93.0 97439.547258 25392.606061 188.249763 10.759022 8.215367 lotus_f1 315.0
452 Romain Grosjean 154 2017 haas 96862.842816 57919.948718 202.998899 12.949904 10.584778 0.0 210 93.0 98754.771845 76684.450000 198.282209 14.428295 10.931202 haas 93.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
107894 Charles Leclerc 844 2018 sauber 93275.184310 24773.640000 204.704019 12.451887 8.588679 0.0 15 126.0 91219.490566 24628.000000 240.750000 8.000000 9.000000 williams 320.0
108282 Charles Leclerc 844 2019 ferrari 92773.335640 24281.250000 212.168689 3.787197 4.152249 2.0 6 571.0 93275.184310 24773.640000 204.704019 12.451887 8.588679 sauber 126.0
109294 Charles Leclerc 844 2020 ferrari 100032.670316 160831.714286 214.948978 8.012151 6.274605 0.0 6 571.0 92773.335640 24281.250000 212.168689 3.787197 4.152249 ferrari 571.0
110306 Charles Leclerc 844 2021 ferrari 97752.122271 230268.000000 215.072843 6.623365 6.410636 0.0 6 571.0 100032.670316 160831.714286 214.948978 8.012151 6.274605 ferrari 571.0
111318 Charles Leclerc 844 2022 ferrari 97835.053827 129264.767442 209.083621 4.234651 3.190917 3.0 6 571.0 97752.122271 230268.000000 215.072843 6.623365 6.410636 ferrari 571.0

214 rows × 19 columns

In [ ]:
# PICKING THE FEATURES I WANT TO USE IN MY MODEL
features = ["constructorRef", "prev_szn(avg_lap)",
            "prev_szn(avg_pit)", 
            "prev_szn(avg_fastest_speed)", "prev_szn(avg_start)",
            "prev_szn(avg_finish)", "prev_szn(constructor)", "driverId", "prev_szn(points_szn)"]
# CHANGING ANY CATEGORICAL VARIABLE INTO A DUMMY
X_dict = results_recent[features].to_dict(orient="records")
# SPECIFYING WHAT MY OUTCOME VARIABLE WILL BE 
y = results_recent["avg_finish"]

# SINCE WE WILL BE USING CROSS VALIDATION, MUST SPECIFY THE PIPELINE 
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=5)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
In [ ]:
# ESTABLISH THE FORMULA WE WILL USE FOR FIVE FOLD CROSS VALIDATION
def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mse = np.mean(-cross_val_score(
        pipeline, X_dict, y, 
        cv=5, scoring="neg_mean_squared_error"
    ))
    return mse
    
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line()
test_errs.sort_values()
# OUR MODEL ERROR IS MINIMIZED AT A K = 22
Out[ ]:
15     7.986051
19     7.988828
20     7.990149
21     8.007162
18     8.050842
16     8.070854
17     8.146174
22     8.168613
14     8.169887
10     8.193201
24     8.212788
26     8.222538
23     8.225122
12     8.236302
11     8.252101
25     8.266042
13     8.269273
9      8.319853
27     8.349731
28     8.379661
29     8.455926
30     8.466280
31     8.481068
32     8.481133
33     8.485076
8      8.554375
34     8.565346
35     8.606605
36     8.725097
37     8.736431
38     8.767447
39     8.769258
7      8.792925
40     8.803312
6      8.824773
3      8.840935
41     8.861058
5      8.880284
42     8.897713
43     8.971718
44     9.047801
45     9.089687
4      9.116851
46     9.149876
48     9.208324
47     9.222640
49     9.271845
50     9.273722
2      9.778606
1     11.155929
dtype: float64

Initially, I thought it would be wise to optimize the KNN using cross fold validation because I wanted to minimize the error in the model, however, I realized with a small sample of drivers and construcotrs, the larger K got, the more similar all of the predicted finishing positions got. I've included this in the project, but wanted to first explain why I would not be using the predicted K going forward

For the training set, I used lagged success indicators for each of the drivers. I also limited the training set to not include 2022, as this would be what I tested my predictions on.

In [ ]:
# ESTABLISHING X TRAIN, Y TRAIN, AND X TEST 
features = ["constructorRef", "prev_szn(avg_lap)",
            "prev_szn(avg_pit)", 
            "prev_szn(avg_fastest_speed)", "prev_szn(avg_start)",
            "prev_szn(avg_finish)", "prev_szn(constructor)", "year", "driverId", "prev_szn(points_szn)"]
X_train = pd.get_dummies(results_recent[features])
# I AM INCLUDING 2022 BUT ONLY USING LAGS IN MY MODEL SO THAT THE RACERS IN 2022 CAN STILL PULL THEIR LAG FROM 2021
X_train = X_train.loc[X_train['year']<2022]

y_train = results_recent[["avg_finish", "year"]]
y_train = y_train.loc[y_train['year']<2022]
y_train = y_train.drop(columns=['year'])

X_test = pd.get_dummies(results_recent[features])
X_test = X_test.loc[X_test['year'] ==2022]
In [ ]:
from sklearn.neighbors import KNeighborsRegressor
KNN_model = KNeighborsRegressor(n_neighbors=2).fit(X_train,y_train)
X_test['predicted_avg_finish'] = KNN_model.predict(X_test) 
predictions = X_test[['driverId', 'prev_szn(avg_finish)', 'predicted_avg_finish']]
predictions
Out[ ]:
driverId prev_szn(avg_finish) predicted_avg_finish
4696 825 12.114251 6.816583
5272 854 15.938380 7.781293
11215 855 11.772856 6.859850
18334 822 4.133040 12.016925
20285 840 9.663535 9.396364
28208 20 9.992874 10.848718
31565 807 7.469027 8.209537
32896 842 7.532860 6.859850
34845 852 10.988889 6.859850
43542 4 8.425617 6.859850
44497 839 8.862966 6.859850
52111 1 2.669031 12.074952
55062 847 11.498685 10.326212
60731 817 8.211523 6.859850
62576 848 7.556338 7.952948
66524 815 5.759056 7.952948
69489 830 1.797030 15.824931
82521 846 6.257563 10.848718
87798 832 6.224111 6.859850
105461 849 14.084732 10.848718
107083 856 16.000000 13.049505
111318 844 6.410636 10.848718
In [ ]:
# MERGING IN DRIVER NAME AND ACTUAL FINISH FOR 2022
results_2022 = results_recent.loc[results_recent['year']==2022] 

predictions = predictions.merge(results_2022[['Driver', 'driverId', 'year', 'avg_finish']], how="right", on = ['driverId'])
predictions
Out[ ]:
driverId prev_szn(avg_finish) predicted_avg_finish Driver year avg_finish
0 825 12.114251 6.816583 Kevin Magnussen 2022 12.198588
1 854 15.938380 7.781293 Mick Schumacher 2022 12.958880
2 855 11.772856 6.859850 Guanyu Zhou 2022 12.016023
3 822 4.133040 12.016925 Valtteri Bottas 2022 9.051937
4 840 9.663535 9.396364 Lance Stroll 2022 10.861426
5 20 9.992874 10.848718 Sebastian Vettel 2022 10.574932
6 807 7.469027 8.209537 Nico Hülkenberg 2022 14.663551
7 842 7.532860 6.859850 Pierre Gasly 2022 10.522231
8 852 10.988889 6.859850 Yuki Tsunoda 2022 10.888385
9 4 8.425617 6.859850 Fernando Alonso 2022 7.827370
10 839 8.862966 6.859850 Esteban Ocon 2022 7.876106
11 1 2.669031 12.074952 Lewis Hamilton 2022 5.238172
12 847 11.498685 10.326212 George Russell 2022 4.205479
13 817 8.211523 6.859850 Daniel Ricciardo 2022 10.874582
14 848 7.556338 7.952948 Alexander Albon 2022 11.992740
15 815 5.759056 7.952948 Sergio Pérez 2022 3.914624
16 830 1.797030 15.824931 Max Verstappen 2022 2.694729
17 846 6.257563 10.848718 Lando Norris 2022 7.183559
18 832 6.224111 6.859850 Carlos Sainz 2022 3.343961
19 849 14.084732 10.848718 Nicholas Latifi 2022 14.526407
20 856 16.000000 13.049505 Nyck de Vries 2022 9.000000
21 844 6.410636 10.848718 Charles Leclerc 2022 3.190917

We know that George Russel switched from Williams to Mercedes in 2022. We will use his data from 2022 and see what our model predicts his average finishing position to be. Our model predicted that Russell would average a finishing position of 10.32 but he actually finished with average position of 4.205. This could be attributed to his really high average finishing position in the previous season. Another switch was Vatteri Battas who transferred from Mercedes to Alfa. Our model predicted that he would have an average finishing position of 12.01 but he actually finished with an average finish of 9.05.

Final Steps

Screen Shot 2022-12-15 at 11.37.14 PM.png

All in all, I have come to the conclusion that it is very difficult to finetune finishing position. Given the fact that there are only 20 racers within a given season, KNN cannot utilize a large K. Additionally, the progress that each teams makes season-to-season, makes it very difficult to make predictions without having specific data on their technology or cars.

In [ ]:
# Converting my CoLab to html so that I can upload this file to Github
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/finaltutorial.ipynb
[NbConvertApp] Converting notebook /content/drive/MyDrive/finaltutorial.ipynb to html
[NbConvertApp] Writing 6691162 bytes to /content/drive/MyDrive/finaltutorial.html
Out[ ]: